-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate monotonic counter metrics to u64_counter!
#6350
base: dev
Are you sure you want to change the base?
Conversation
Notes: - Fixed a typo in the not found attribute: `persisted_quieries.not_found` -> `persisted_queries.not_found`. - Added description, it would be useful for someone to check it. It reads to me like *every* request is measured?
Notes: - Removes the `apollo_router_deduplicated_subscriptions_total` metric. This is already captured by `apollo.router.operations.subscriptions` in the `subscriptions.deduplicated` attribute. - The `apollo.router.operations.batching` metric appears to use an older style of attribute naming?
Notes: - The description for `apollo_router_skipped_event_count` may not entirely be correct?
Notes: - This combined a log message and a metric: now they are separate.
✅ Docs Preview ReadyNo new or changed pages found. |
@goto-bus-stop, please consider creating a changeset entry in |
CI performance tests
|
@@ -210,7 +210,7 @@ where | |||
Response: Send + 'static + Debug, | |||
TransformedResponse: Send + 'static + Debug, | |||
{ | |||
let query = query_name::<Query>(); | |||
let query_name = query_name::<Query>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify that it isn't the full query text.
mode = %"passthrough", | ||
u64_counter!( | ||
"apollo_router_deduplicated_subscriptions_total", | ||
"Total deduplicated subscription requests (deprecated)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I marked a few apparently duplicate, or poorly named metrics as deprecated. This one appears to be a less informative duplicate of apollo.router.operations.subscriptions
, which has a subscriptions.deduplicated
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Good catch. It makes sense.
monotonic_counter.apollo_require_authentication_failure_count = 1u64, | ||
u64_counter!( | ||
"apollo_require_authentication_failure_count", | ||
"Number of unauthenticated requests (deprecated)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I marked this as deprecated because the naming doesn't seem to follow convention, but I don't know if we have a proper alternative for this? We have apollo.router.operations.authorization
below but I'm not sure it reports the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the right thing to do. cc @BrynCooke you probably have more context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On metrics you marked as deprecated in description that could be interesting to also document it as deprecated in docs. We already have a deprecated section. I know it's also part of another ticket but we both work on deprecating different metrics so in order to not forget any of these deprecated metrics I think it's worth documenting it directly
); | ||
u64_counter!( | ||
"apollo.router.operations.jwt", | ||
"Number of requests with JWT authentication", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should not add authentication.jwt.failed = false
in the future (for 2.0) to be consistent with what sigv4 is doing for example
monotonic_counter.apollo_require_authentication_failure_count = 1u64, | ||
u64_counter!( | ||
"apollo_require_authentication_failure_count", | ||
"Number of unauthenticated requests (deprecated)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the right thing to do. cc @BrynCooke you probably have more context
tracing::info!(monotonic_counter.apollo_router_timeout = 1u64,); | ||
u64_counter!( | ||
"apollo_router_timeout", | ||
"Number of timed out client requests", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can mark it as deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is to use a custom instrument, right? So that I can document it in the deprecated metrics section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
telemetry:
instrumentation:
instruments:
router:
http.server.request.duration:
# Adding subgraph name, response status code from the router and the operation name
attributes:
http.response.status_code: true # If status code == 504 then it's a timeout at the router http request level
graphql.operation.name:
operation_name: string
subgraph:
# Adding subgraph name, response status code from the subgraph and original operation name from the supergraph
http.client.request.duration:
attributes:
subgraph.name: true
http.response.status_code: # If status code == 504 then it's a timeout at the subgraph http request level
subgraph_response_status: code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom instrumentation is an enterprise only feature, but we can also get the status code without using enterprise features:
telemetry:
instrumentation:
instruments:
default_requirement_level: required
monotonic_counter.apollo.router.graphql_error = 1u64, | ||
u64_counter!( | ||
"apollo.router.graphql_error", | ||
"Number of GraphQL error responses returned by the router", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Number of GraphQL error responses returned by the router", | |
"Number of GraphQL error responses returned by the router (DEPRECATED)", |
tracing::info!(monotonic_counter.apollo.router.graphql_error = count,); | ||
u64_counter!( | ||
"apollo.router.graphql_error", | ||
"Number of GraphQL error responses returned by the router", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Number of GraphQL error responses returned by the router", | |
"Number of GraphQL error responses returned by the router (DEPRECATED)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah really? I thought we quite recently added more of these (though the metric name is definitely not good). Is there a replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
telemetry:
instrumentation:
instruments:
router:
http.server.request.duration:
attributes:
http.response.status_code: true
graphql.operation.name:
operation_name: string
# This attribute will be set to true if the response contains graphql errors
graphql.errors: # This will be true if it contains graphql errors
on_graphql_error: true
supergraph:
supergraph.response.errors:
type: counter
value: event_unit
description: number of request containing graphql errors and the error extension code
unit: req
attributes:
graphql.operation.name: true
errors.code:
response_errors: $[0].extensions.code
condition:
exists:
response_errors: $.*.extensions.code
Here is the alternative
code = code | ||
u64_counter!( | ||
"apollo.router.graphql_error", | ||
"Number of GraphQL error responses returned by the router", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Number of GraphQL error responses returned by the router", | |
"Number of GraphQL error responses returned by the router (DEPRECATED)", |
mode = %"passthrough", | ||
u64_counter!( | ||
"apollo_router_deduplicated_subscriptions_total", | ||
"Total deduplicated subscription requests (deprecated)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Good catch. It makes sense.
Co-authored-by: Coenen Benjamin <[email protected]>
…custom instrumentation
This reverts commit 7e02584.
This is part 1 out of... 4 or 5? of a series of work to move over to our new telemetry macros.
This part does the easiest part :), the
tracing::info!(monotonic_counter.)
macros that should now useu64_counter!()
. I used a lot of commits so I could take notes while doing the work, I'll copy them to PR comments so there's no need to look at each commit individually.I avoided breaking changes to the metrics for now, so this is targeted at 1.x.
Two uses of
tracing::info!(monotonic_counter.)
remain, these are already being addressed in #6338.Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩